Search CORE

20 research outputs found

Put three and three together: Triangle-driven community detection

Author: Brunat Blay Josep Maria
Domínguez Sal David
Larriba Pey Josep
Prat Pérez Arnau
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

Community detection has arisen as one of the most relevant topics in the field of graph data mining due to its applications in many fields such as biology, social networks, or network traffic analysis. Although the existing metrics used to quantify the quality of a community work well in general, under some circumstances, they fail at correctly capturing such notion. The main reason is that these metrics consider the internal community edges as a set, but ignore how these actually connect the vertices of the community. We propose the Weighted Community Clustering (WCC), which is a new community metric that takes the triangle instead of the edge as the minimal structural motif indicating the presence of a strong relation in a graph. We theoretically analyse WCC in depth and formally prove, by means of a set of properties, that the maximization of WCC guarantees communities with cohesion and structure. In addition, we propose Scalable Community Detection (SCD), a community detection algorithm based on WCC, which is designed to be fast and scalable on SMP machines, showing experimentally that WCC correctly captures the concept of community in social networks using real datasets. Finally, using ground-truth data, we show that SCD provides better quality than the best disjoint community detection algorithms of the state of the art while performing faster.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Analysis and optimization of question answering systems

Author: Domínguez Sal David
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2010
Field of study

Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

A machine learning approach for factoid question answering

Author: Domínguez Sal David
Surdeanu Mihai
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/01/2006
Field of study

Este artículo presenta un sistema de Question Answering para respuestas de tipo entidad que está completamente basado en técnicas de aprendizaje automático. Nuestro sistema consigue resultados similares al estado del arte para sistemas de QA con reglas para la extracción de respuestas construidas por un experto humano. Nuestro enfoque evita la intervención humana y simplifica la adaptación del sistema a nuevos entornos o conjuntos de atributos más extensos. Además, su tiempo de respuesta es adecuado para lugares donde se necesita un servicio de Question Answering orientado a tiempo real.This paper presents a factoid Question Answering system that is fully based on machine learning. Our system achieves similar results to a state-of-the-art QA system with answer extraction rules developed by a human expert. Our approach avoids human intervention and simplifies adaptation of the system to new environments or extended feature sets. Moreover, its response time is suitable for places where real-time Question Answering is needed.David Domínguez is granted by the Generalitat de Catalunya (2005FI00437). Mihai Surdeanu is a research fellow within the Ramón y Cajal program of the Spanish Ministry of Education and Science. DAMA-UPC want to thank Generalitat de Catalunya for its support through grant GRE-00352

Repositorio Institucional de la Universidad de Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

High quality, scalable and parallel community detection for large real graphs

Author: Domínguez Sal David
Larriba Pey Josep
Prat Pérez Arnau
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2014
Field of study

Community detection has arisen as one of the most relevant topics in the field of graph mining, principally for its applications in domains such as social or biological networks analysis. Different community detection algorithms have been proposed during the last decade, approaching the problem from different perspectives. However, existing algorithms are, in general, based on complex and expensive computations, making them unsuitable for large graphs with millions of vertices and edges such as those usually found in the real world. In this paper, we propose a novel disjoint community detection algorithm called Scalable Community Detection (SCD). By combining different strategies, SCD partitions the graph by maximizing the Weighted Community Clustering (WCC), a recently proposed community detection metric based on triangle analysis. Using real graphs with ground truth overlapped communities, we show that SCD outperforms the current state of the art proposals (even those aimed at finding overlapping communities) in terms of quality and performance. SCD provides the speed of the fastest algorithms and the quality in terms of NMI and F1Score of the most accurate state of the art proposals. We show that SCD is able to run up to two orders of magnitude faster than practical existing solutions by exploiting the parallelism of current multi-core processors, enabling us to process graphs of unprecedented size in short execution times.Peer ReviewedPostprint (published version

Crossref

UPCommons. Portal del coneixement obert de la UPC

Massive query expansion by exploiting graph knowledge bases for image retrieval

Author: Domínguez Sal David
Guisado Gámez Joan
Larriba Pey Josep
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2014
Field of study

Annotation-based techniques for image retrieval suffer from sparse and short image textual descriptions. Moreover, users are often not able to describe their needs with the most appropriate keywords. This situation is a breeding ground for a vocabulary mismatch problem resulting in poor results in terms of retrieval precision. In this paper, we propose a query expansion technique for queries expressed as keywords and short natural language descriptions. We present a new massive query expansion strategy that enriches queries using a graph knowledge base by identifying the query concepts, and adding relevant synonyms and semantically related terms. We propose a topological graph enrichment technique that analyzes the network of relations among the concepts, and suggests semantically related terms by path and community detection analysis of the knowledge graph. We perform our expansions by using two versions of Wikipedia as knowledge base achieving improvements of the system's precision up to more than 27% Copyright 2014 ACM.Peer Reviewe

Crossref

UPCommons. Portal del coneixement obert de la UPC

Cache-aware load balancing vs. cooperative caching for distributed search engines

Author: Domínguez Sal David
Larriba Pey Josep
Pérez Casany Marta
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

In this paper we study the performance of a distributed search engine from a data caching point of view. We compare and combine two different approaches to achieve better hit rates: (a) send the queries to the node which currently has the related data in its local memory (cache-aware load balancing), and (b) send the cached contents to the node where a query is being currently processed (cooperative caching). Furthermore, we study the best scheduling points in the query computation in which they can be reassigned to another node, and how this reassignation should be performed. Our analysis is guided by statistical tools on a real question answering system for several query distributions, which are typically found in query logs.Peer Reviewe

Crossref

UPCommons. Portal del coneixement obert de la UPC

Social based layouts for the increase of locality in graph operations

Author: Domínguez Sal David
Larriba Pey Josep
Prat Pérez Arnau
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Graphs provide a natural data representation for analyzing the relationships among entities in many application areas. Since the analysis algorithms perform memory intensive operations, it is important that the graph layout is adapted to take advantage of the memory hierarchy. Here, we propose layout strategies based on community detection to improve the in-memory data locality of generic graph algorithms. We conclude that the detection of communities in a graph provides a layout strategy that improves the performance of graph algorithms consistently over other state of the art strategies

UPCommons. Portal del coneixement obert de la UPC

Cooperative cache analysis for distributed search engines

Author: Domínguez Sal David
Larriba Pey Josep
Pérez Casany Marta
Publication venue: 'Inderscience Publishers'
Publication date: 01/01/2010
Field of study

In this paper, we study the performance of a distributed search engine from a data caching point of view using statistical tools on a varied set of configurations. We study two strategies to achieve better performance: cacheaware load balancing that issues the queries to nodes that store the computation in cache; and cooperative caching (CC) that stores and transfers the available computed contents from one node in the network to others. Since cache-aware decisions depend on information about the recent history, we also analyse how the ageing of this information impacts the system performance. Our results show that the combination of both strategies yield better throughput than individually implementing cooperative cache or cache-aware load balancing strategies because of a synergic improvement of the hit rate. Furthermore, the analysis concludes that the data structures to monitor the system need only moderate precision to achieve optimal throughput.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Cooperative cache analysis for distributed search engines

Author: Domínguez Sal David
Larriba Pey Josep
Pérez Casany Marta
Publication venue
Publication date
Field of study

RECERCAT

Two-way replacement selection

Author: Domínguez Sal David
Larriba Pey Josep
Martínez Palau Xavier
Publication venue: Association for Computing Machinery (ACM)
Publication date
Field of study

The performance of external sorting using merge sort is highly dependent on the length of the runs generated. One of the most commonly used run generation strategies is Replacement Selection (RS) because, on average, it generates runs that are twice the size of the memory available. However, the length of the runs generated by RS is downsized for data with certain characteristics, like inputs sorted inversely with respect to the desired output order. The goal of this paper is to propose and analyze two-way replacement selection (2WRS), which is a generalization of RS obtained by implementing two heaps instead of the single heap implemented by RS. The appropriate management of these two heaps allows generating runs larger than the memory available in a stable way, i.e. independent from the characteristics of the datasets. Depending on the changing characteristics of the input dataset, 2WRS assigns a new data record to one or the other heap, and grows or shrinks each heap, accommodating to the growing or decreasing tendency of the dataset. On average, 2WRS creates runs of at least the length generated by RS, and longer for datasets that combine increasing and decreasing data subsets. We tested both algorithms on large datasets with different characteristics and 2WRS achieves speedups at least similar to RS, and over 2.5 when RS fails to generate large runs

RECERCAT